Task 1 - Exploratory analysis of the data

Heart Disease Dataset

Team composition:

  • Amihaesei Sergiu
  • Stoica George

Preliminaries

In [94]:
import pandas as pd
import numpy as np
import plotly.graph_objects as go
import plotly.express as px
import plotly.io as pio
import matplotlib.pyplot as plt
import scipy.stats as stats
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
import seaborn as sns

!pip install umap-learn
!pip install projection-pursuit
import umap
from skpp import ProjectionPursuitRegressor

#When saving to HTML uncomment the line below
pio.renderers.default = "notebook" #"colab" is default
Requirement already satisfied: umap-learn in /usr/local/lib/python3.7/dist-packages (0.5.2)
Requirement already satisfied: scikit-learn>=0.22 in /usr/local/lib/python3.7/dist-packages (from umap-learn) (1.0.2)
Requirement already satisfied: scipy>=1.0 in /usr/local/lib/python3.7/dist-packages (from umap-learn) (1.4.1)
Requirement already satisfied: pynndescent>=0.5 in /usr/local/lib/python3.7/dist-packages (from umap-learn) (0.5.6)
Requirement already satisfied: tqdm in /usr/local/lib/python3.7/dist-packages (from umap-learn) (4.62.3)
Requirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.7/dist-packages (from umap-learn) (1.21.5)
Requirement already satisfied: numba>=0.49 in /usr/local/lib/python3.7/dist-packages (from umap-learn) (0.51.2)
Requirement already satisfied: setuptools in /usr/local/lib/python3.7/dist-packages (from numba>=0.49->umap-learn) (57.4.0)
Requirement already satisfied: llvmlite<0.35,>=0.34.0.dev0 in /usr/local/lib/python3.7/dist-packages (from numba>=0.49->umap-learn) (0.34.0)
Requirement already satisfied: joblib>=0.11 in /usr/local/lib/python3.7/dist-packages (from pynndescent>=0.5->umap-learn) (1.1.0)
Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.7/dist-packages (from scikit-learn>=0.22->umap-learn) (3.1.0)
Requirement already satisfied: projection-pursuit in /usr/local/lib/python3.7/dist-packages (1.0)
Requirement already satisfied: matplotlib in /usr/local/lib/python3.7/dist-packages (from projection-pursuit) (3.2.2)
Requirement already satisfied: numpy in /usr/local/lib/python3.7/dist-packages (from projection-pursuit) (1.21.5)
Requirement already satisfied: scikit-learn in /usr/local/lib/python3.7/dist-packages (from projection-pursuit) (1.0.2)
Requirement already satisfied: pytest in /usr/local/lib/python3.7/dist-packages (from projection-pursuit) (3.6.4)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /usr/local/lib/python3.7/dist-packages (from matplotlib->projection-pursuit) (3.0.7)
Requirement already satisfied: python-dateutil>=2.1 in /usr/local/lib/python3.7/dist-packages (from matplotlib->projection-pursuit) (2.8.2)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.7/dist-packages (from matplotlib->projection-pursuit) (1.3.2)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.7/dist-packages (from matplotlib->projection-pursuit) (0.11.0)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.7/dist-packages (from python-dateutil>=2.1->matplotlib->projection-pursuit) (1.15.0)
Requirement already satisfied: more-itertools>=4.0.0 in /usr/local/lib/python3.7/dist-packages (from pytest->projection-pursuit) (8.12.0)
Requirement already satisfied: py>=1.5.0 in /usr/local/lib/python3.7/dist-packages (from pytest->projection-pursuit) (1.11.0)
Requirement already satisfied: setuptools in /usr/local/lib/python3.7/dist-packages (from pytest->projection-pursuit) (57.4.0)
Requirement already satisfied: attrs>=17.4.0 in /usr/local/lib/python3.7/dist-packages (from pytest->projection-pursuit) (21.4.0)
Requirement already satisfied: pluggy<0.8,>=0.5 in /usr/local/lib/python3.7/dist-packages (from pytest->projection-pursuit) (0.7.1)
Requirement already satisfied: atomicwrites>=1.0 in /usr/local/lib/python3.7/dist-packages (from pytest->projection-pursuit) (1.4.0)
Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.7/dist-packages (from scikit-learn->projection-pursuit) (3.1.0)
Requirement already satisfied: scipy>=1.1.0 in /usr/local/lib/python3.7/dist-packages (from scikit-learn->projection-pursuit) (1.4.1)
Requirement already satisfied: joblib>=0.11 in /usr/local/lib/python3.7/dist-packages (from scikit-learn->projection-pursuit) (1.1.0)
In [95]:
data = pd.read_csv("./heart.csv")

Checking data consistency:

In [96]:
data.isna().any()
Out[96]:
age         False
sex         False
cp          False
trestbps    False
chol        False
fbs         False
restecg     False
thalach     False
exang       False
oldpeak     False
slope       False
ca          False
thal        False
target      False
dtype: bool

Univariate Analysis

Central tendence & Spread

Mode - represents the most frequent value. Since it's not included in Pandas' discribe, we had to call it separately. It returns multiple rows if there are multiple modes available, hence we only took the first row.

In [97]:
numerical_data = data[["age", "cp", "trestbps", "chol", "fbs", "thalach"]]
numerical_data.mode().iloc[0]
Out[97]:
age          58.0
cp            0.0
trestbps    120.0
chol        197.0
fbs           0.0
thalach     162.0
Name: 0, dtype: float64

Describe offers descriptive statistics: the mean, median (here represented by 50%), range and standard deviation

In [98]:
numerical_data.describe()
Out[98]:
age cp trestbps chol fbs thalach
count 303.000000 303.000000 303.000000 303.000000 303.000000 303.000000
mean 54.366337 0.966997 131.623762 246.264026 0.148515 149.646865
std 9.082101 1.032052 17.538143 51.830751 0.356198 22.905161
min 29.000000 0.000000 94.000000 126.000000 0.000000 71.000000
25% 47.500000 0.000000 120.000000 211.000000 0.000000 133.500000
50% 55.000000 1.000000 130.000000 240.000000 0.000000 153.000000
75% 61.000000 2.000000 140.000000 274.500000 0.000000 166.000000
max 77.000000 3.000000 200.000000 564.000000 1.000000 202.000000

Form of the distribution

Skewness measures the asymmetry of a distribution:

  • When the value of the skewness if zero, the distribution follows the normal distribution
  • When the value of the skewness is negative, the tail of the distribution is longer towards the left hand side of the curve.
  • When the value of the skewness is positive, the tail of the distribution is longer towards the right hand side of the curve
  • From the dataset, fbs has the highest positive value, while sex has the highest negative value.

    The variables sex is really skewed to the right, and since it is categorical it means there are more males than females in the dataset. The variable age is really close to 0, which implies that it almost follows the normal distribution. For the rest of the variables, the skewness is more visible in the density plot.

In [99]:
data.skew()
Out[99]:
age        -0.202463
sex        -0.791335
cp          0.484732
trestbps    0.713768
chol        1.143401
fbs         1.986652
restecg     0.162522
thalach    -0.537410
exang       0.742532
oldpeak     1.269720
slope      -0.508316
ca          1.310422
thal       -0.476722
target     -0.179821
dtype: float64

Kurtosis measures the peakedness of a distribution:

  • High kurtosis in a data set is an indicator that data has heavy outliers.
  • Low kurtosis in a data set is an indicator that data has lack of outliers.
  • From the dataset variables, chol has heavy outliers, while thalach lacks outliers

    The variable chol, which is the level of cholesterol, has heavy outliers. The variables thalach, thal, and age dont have that many outliers.

In [100]:
data.kurtosis()
Out[100]:
age        -0.542167
sex        -1.382961
cp         -1.193071
trestbps    0.929054
chol        4.505423
fbs         1.959678
restecg    -1.362673
thalach    -0.061970
exang      -1.458317
oldpeak     1.575813
slope      -0.627521
ca          0.839253
thal        0.297915
target     -1.980783
dtype: float64

Frequencies for categorical attributes

We can see that indeed the sex is heavily disproportionate, aswell as fbs, exang. Some contain even some sort of outliers like restecg, ca, thal.

In [101]:
categorical_variables = ["sex", "fbs", "restecg", "exang", "cp", "slope", "ca", "thal"]

for i in categorical_variables:
  print(data[i].value_counts())
1    207
0     96
Name: sex, dtype: int64
0    258
1     45
Name: fbs, dtype: int64
1    152
0    147
2      4
Name: restecg, dtype: int64
0    204
1     99
Name: exang, dtype: int64
0    143
2     87
1     50
3     23
Name: cp, dtype: int64
2    142
1    140
0     21
Name: slope, dtype: int64
0    175
1     65
2     38
3     20
4      5
Name: ca, dtype: int64
2    166
3    117
1     18
0      2
Name: thal, dtype: int64

Graphs

Density

Here we can clearly see how skewness and kurtosis were determined

  • We can observe that there are more men than women, the majority of the subjects are between 50 and 60 years old.
In [102]:
nrow = 7
ncol = 2

fig, axes = plt.subplots(nrow, ncol, constrained_layout=True)

count = 0
for r in range(nrow):
  for c in range(ncol):
    title = 'Attribute "' + data.columns[count] + '" density plot'
    data.iloc[:, count].plot.density(ax=axes[r,c], title=title, figsize=(10,20))

    count += 1

Frequency

Shows how the attributes' values are distributed

In [103]:
nrow = 7
ncol = 2

fig, axes = plt.subplots(nrow, ncol, constrained_layout=True)

count = 0
for r in range(nrow):
  for c in range(ncol):
    title = 'Attribute "' + data.columns[count] + '" frequency plot'
    data.iloc[:, count].plot.hist(ax=axes[r,c], title=title, figsize=(10,20))

    count += 1

Boxplot

The boxplot shows the minimum, first quartile (25% of the data), median (50% of the data), third quartile (75% of the data) and the maximum. The values outside of $[minimum, maximum]$ are plotted as points and are to be considered outliers.

In [104]:
nrow = 7
ncol = 2

fig, axes = plt.subplots(nrow, ncol, constrained_layout=True)

count = 0
for r in range(nrow):
  for c in range(ncol):
    title = 'Attribute "' + data.columns[count] + '" boxplot'
    data.iloc[:, count].plot.box(ax=axes[r,c], title=title, figsize=(10,20))

    count += 1

Conclusion

  • Attributes thal, ca, oldpeak, thalach, fbs, chol, trestbps have outliers
  • Categorical attributes are not uniformely distributed

Bivariate / Multivariate Analysis

Pearson correlation

  • Measures the strenght of the linear dependency between two variables.
  • Values closer to 1 indicate a direct correlation between attributes.
  • Values closer to -1 indicate an inverse correlation between attributes.
  • Values closer to 0 indicate no correlation between attributes.
In [105]:
numerical_data.corr(method="pearson").style.background_gradient(cmap='Blues')
Out[105]:
  age cp trestbps chol fbs thalach
age 1.000000 -0.068653 0.279351 0.213678 0.121308 -0.398522
cp -0.068653 1.000000 0.047608 -0.076904 0.094444 0.295762
trestbps 0.279351 0.047608 1.000000 0.123174 0.177531 -0.046698
chol 0.213678 -0.076904 0.123174 1.000000 0.013294 -0.009940
fbs 0.121308 0.094444 0.177531 0.013294 1.000000 -0.008567
thalach -0.398522 0.295762 -0.046698 -0.009940 -0.008567 1.000000

Spearman correlation

  • Unlike Pearson, works with monotonic relationships as well.
In [106]:
numerical_data.corr(method="spearman").style.background_gradient(cmap='Blues')
Out[106]:
  age cp trestbps chol fbs thalach
age 1.000000 -0.087494 0.285617 0.195786 0.113978 -0.398052
cp -0.087494 1.000000 0.035413 -0.091721 0.089775 0.324013
trestbps 0.285617 0.035413 1.000000 0.126562 0.151984 -0.040407
chol 0.195786 -0.091721 0.126562 1.000000 0.018463 -0.046766
fbs 0.113978 0.089775 0.151984 0.018463 1.000000 -0.014273
thalach -0.398052 0.324013 -0.040407 -0.046766 -0.014273 1.000000

Kendall correlation

  • Has a smaller gross error sensitivity and a smaller asymptotic variance than Spearman
In [107]:
numerical_data.corr(method="kendall").style.background_gradient(cmap='Blues')
Out[107]:
  age cp trestbps chol fbs thalach
age 1.000000 -0.071577 0.201071 0.135062 0.094595 -0.280009
cp -0.071577 1.000000 0.027548 -0.069899 0.083862 0.246160
trestbps 0.201071 0.027548 1.000000 0.086474 0.127574 -0.027760
chol 0.135062 -0.069899 0.086474 1.000000 0.015140 -0.031437
fbs 0.094595 0.083862 0.127574 0.015140 1.000000 -0.011749
thalach -0.280009 0.246160 -0.027760 -0.031437 -0.011749 1.000000

From the above correlation matrices we can observe a small correlation between thalach and cp and a small inverse correlation between thalach and age.

Independence tests

Chi-Square Test for Independence

  • Used to examine the differences between categorical variables in order to evaluate the difference between the predicted and observed results
  • The p-value tells whether the test is significant or not
  • A p-value over 0.05 shows that the test is not significant enough
  • The ${chi}^2$ represents the expected frequencies.
In [108]:
dof = 0.05

categorical_variables = ["sex", "fbs", "restecg", "exang", "cp", "slope", "ca", "thal", "target"]
semnificative = []

for i in range(len(categorical_variables)):
    for j in range(i + 1, len(categorical_variables)):
        ct = pd.crosstab(data[categorical_variables[i]], data[categorical_variables[j]], margins=True)
        values = [ct[i][0:len(ct.index) - 1].values for i in range(len(ct.columns) - 1)]
        chi2, p, _, _ = stats.chi2_contingency(values)
        print(f"Independence between {categorical_variables[i]:<7} and {categorical_variables[j]:<7}. Chi-squared test: ")
        print(f"chi2: {chi2:.3f}, p-value: {p:.3f}")
        print(f"{'Failed, p-value bigger than ' + str(dof) if p > dof else 'Relevant'}\n")
        if (p < 0.0001):
            semnificative.append([categorical_variables[i], categorical_variables[j]])

print("Semnificative pairs")
for a, b in semnificative:
    print(a, b)
Independence between sex     and fbs    . Chi-squared test: 
chi2: 0.372, p-value: 0.542
Failed, p-value bigger than 0.05

Independence between sex     and restecg. Chi-squared test: 
chi2: 3.697, p-value: 0.157
Failed, p-value bigger than 0.05

Independence between sex     and exang  . Chi-squared test: 
chi2: 5.449, p-value: 0.020
Relevant

Independence between sex     and cp     . Chi-squared test: 
chi2: 6.822, p-value: 0.078
Failed, p-value bigger than 0.05

Independence between sex     and slope  . Chi-squared test: 
chi2: 0.648, p-value: 0.723
Failed, p-value bigger than 0.05

Independence between sex     and ca     . Chi-squared test: 
chi2: 7.848, p-value: 0.097
Failed, p-value bigger than 0.05

Independence between sex     and thal   . Chi-squared test: 
chi2: 44.626, p-value: 0.000
Relevant

Independence between sex     and target . Chi-squared test: 
chi2: 22.717, p-value: 0.000
Relevant

Independence between fbs     and restecg. Chi-squared test: 
chi2: 2.297, p-value: 0.317
Failed, p-value bigger than 0.05

Independence between fbs     and exang  . Chi-squared test: 
chi2: 0.075, p-value: 0.784
Failed, p-value bigger than 0.05

Independence between fbs     and cp     . Chi-squared test: 
chi2: 3.886, p-value: 0.274
Failed, p-value bigger than 0.05

Independence between fbs     and slope  . Chi-squared test: 
chi2: 3.373, p-value: 0.185
Failed, p-value bigger than 0.05

Independence between fbs     and ca     . Chi-squared test: 
chi2: 7.356, p-value: 0.118
Failed, p-value bigger than 0.05

Independence between fbs     and thal   . Chi-squared test: 
chi2: 5.542, p-value: 0.136
Failed, p-value bigger than 0.05

Independence between fbs     and target . Chi-squared test: 
chi2: 0.106, p-value: 0.744
Failed, p-value bigger than 0.05

Independence between restecg and exang  . Chi-squared test: 
chi2: 2.976, p-value: 0.226
Failed, p-value bigger than 0.05

Independence between restecg and cp     . Chi-squared test: 
chi2: 9.688, p-value: 0.138
Failed, p-value bigger than 0.05

Independence between restecg and slope  . Chi-squared test: 
chi2: 10.947, p-value: 0.027
Relevant

Independence between restecg and ca     . Chi-squared test: 
chi2: 10.014, p-value: 0.264
Failed, p-value bigger than 0.05

Independence between restecg and thal   . Chi-squared test: 
chi2: 3.526, p-value: 0.740
Failed, p-value bigger than 0.05

Independence between restecg and target . Chi-squared test: 
chi2: 10.023, p-value: 0.007
Relevant

Independence between exang   and cp     . Chi-squared test: 
chi2: 67.348, p-value: 0.000
Relevant

Independence between exang   and slope  . Chi-squared test: 
chi2: 25.131, p-value: 0.000
Relevant

Independence between exang   and ca     . Chi-squared test: 
chi2: 12.809, p-value: 0.012
Relevant

Independence between exang   and thal   . Chi-squared test: 
chi2: 32.959, p-value: 0.000
Relevant

Independence between exang   and target . Chi-squared test: 
chi2: 55.945, p-value: 0.000
Relevant

Independence between cp      and slope  . Chi-squared test: 
chi2: 27.747, p-value: 0.000
Relevant

Independence between cp      and ca     . Chi-squared test: 
chi2: 33.970, p-value: 0.001
Relevant

Independence between cp      and thal   . Chi-squared test: 
chi2: 41.892, p-value: 0.000
Relevant

Independence between cp      and target . Chi-squared test: 
chi2: 81.686, p-value: 0.000
Relevant

Independence between slope   and ca     . Chi-squared test: 
chi2: 11.501, p-value: 0.175
Failed, p-value bigger than 0.05

Independence between slope   and thal   . Chi-squared test: 
chi2: 35.283, p-value: 0.000
Relevant

Independence between slope   and target . Chi-squared test: 
chi2: 47.507, p-value: 0.000
Relevant

Independence between ca      and thal   . Chi-squared test: 
chi2: 23.639, p-value: 0.023
Relevant

Independence between ca      and target . Chi-squared test: 
chi2: 74.367, p-value: 0.000
Relevant

Independence between thal    and target . Chi-squared test: 
chi2: 85.304, p-value: 0.000
Relevant

Semnificative pairs
sex thal
sex target
exang cp
exang slope
exang thal
exang target
cp thal
cp target
slope thal
slope target
ca target
thal target

Fisher's Test

  • Used in the analysis of contingency tables.
  • It works only for binary variables.
  • The p-values ilustrate the significance of the test.
  • The ratio is the prior odds ratio.
In [109]:
dof = 0.05

binary_variables = ["sex", "fbs", "exang", "target"]
semnificative = []

for i in range(len(binary_variables)):
    for j in range(i + 1, len(binary_variables)):
        ct = pd.crosstab(data[binary_variables[i]], data[binary_variables[j]], margins=True)
        values = [ct[i][0:len(ct.index) - 1].values for i in range(len(ct.columns) - 1)]
        ratio, p2 = stats.fisher_exact(values)
        print(f"Independence between {binary_variables[i]:<7} and {binary_variables[j]:<7}. Fisher's exact test:")
        print(f"ratio: {ratio:.3f}, p-value: {p2:.3f}")
        print(f"{'Failed, p-value bigger than ' + str(dof) if p2 > dof else 'Relevant'}\n")
        if (p2 < 0.0001):
            semnificative.append([binary_variables[i], binary_variables[j]])

print("Semnificative pairs")
for a, b in semnificative:
    print(a, b)
Independence between sex     and fbs    . Fisher's exact test:
ratio: 1.328, p-value: 0.491
Failed, p-value bigger than 0.05

Independence between sex     and exang  . Fisher's exact test:
ratio: 1.992, p-value: 0.017
Relevant

Independence between sex     and target . Fisher's exact test:
ratio: 0.272, p-value: 0.000
Relevant

Independence between fbs     and exang  . Fisher's exact test:
ratio: 1.163, p-value: 0.731
Failed, p-value bigger than 0.05

Independence between fbs     and target . Fisher's exact test:
ratio: 0.854, p-value: 0.631
Failed, p-value bigger than 0.05

Independence between exang   and target . Fisher's exact test:
ratio: 0.132, p-value: 0.000
Relevant

Semnificative pairs
sex target
exang target

T-test Independence

  • Mostly used for identifying a statistically significant difference in the means between 2 groups.
  • The t-values quantify the difference between the arithmetic means of the two samples.
  • We split each variable in the dataset into 2 groups, split according to each binary variable
In [110]:
dof = 0.05
binary_variables = ["sex", "fbs", "exang", "target"]
semnificative = []

for column in numerical_data.columns:
    for attribute in binary_variables:
        if column != attribute:
            group_1 = data[column][data[attribute] == 1]
            group_2 = data[column][data[attribute] == 0]
            t, p = stats.ttest_ind(group_1, group_2)
            print(f"Variable {column:<8} separated by {attribute:<6} has t-value: {t:.3f} and p-value: {p:.3f}.")
            print(f"{'Failed, p-value bigger than ' + str(dof) if p > dof else 'Relevant'}\n")
        if (p < 0.0001):
            semnificative.append([column, attribute])

print("Semnificative pairs")
for a, b in semnificative:
    print(a, b)
Variable age      separated by sex    has t-value: -1.716 and p-value: 0.087.
Failed, p-value bigger than 0.05

Variable age      separated by fbs    has t-value: 2.120 and p-value: 0.035.
Relevant

Variable age      separated by exang  has t-value: 1.687 and p-value: 0.093.
Failed, p-value bigger than 0.05

Variable age      separated by target has t-value: -4.015 and p-value: 0.000.
Relevant

Variable cp       separated by sex    has t-value: -0.857 and p-value: 0.392.
Failed, p-value bigger than 0.05

Variable cp       separated by fbs    has t-value: 1.646 and p-value: 0.101.
Failed, p-value bigger than 0.05

Variable cp       separated by exang  has t-value: -7.444 and p-value: 0.000.
Relevant

Variable cp       separated by target has t-value: 8.353 and p-value: 0.000.
Relevant

Variable trestbps separated by sex    has t-value: -0.986 and p-value: 0.325.
Failed, p-value bigger than 0.05

Variable trestbps separated by fbs    has t-value: 3.130 and p-value: 0.002.
Relevant

Variable trestbps separated by exang  has t-value: 1.176 and p-value: 0.241.
Failed, p-value bigger than 0.05

Variable trestbps separated by target has t-value: -2.541 and p-value: 0.012.
Relevant

Variable chol     separated by sex    has t-value: -3.503 and p-value: 0.001.
Relevant

Variable chol     separated by fbs    has t-value: 0.231 and p-value: 0.818.
Failed, p-value bigger than 0.05

Variable chol     separated by exang  has t-value: 1.165 and p-value: 0.245.
Failed, p-value bigger than 0.05

Variable chol     separated by target has t-value: -1.484 and p-value: 0.139.
Failed, p-value bigger than 0.05

Variable fbs      separated by sex    has t-value: 0.782 and p-value: 0.435.
Failed, p-value bigger than 0.05

Variable fbs      separated by exang  has t-value: 0.445 and p-value: 0.656.
Failed, p-value bigger than 0.05

Variable fbs      separated by target has t-value: -0.487 and p-value: 0.627.
Failed, p-value bigger than 0.05

Variable thalach  separated by sex    has t-value: -0.764 and p-value: 0.445.
Failed, p-value bigger than 0.05

Variable thalach  separated by fbs    has t-value: -0.149 and p-value: 0.882.
Failed, p-value bigger than 0.05

Variable thalach  separated by exang  has t-value: -7.101 and p-value: 0.000.
Relevant

Variable thalach  separated by target has t-value: 8.070 and p-value: 0.000.
Relevant

Semnificative pairs
age target
cp exang
cp target
thalach exang
thalach target

Vizualization

Scatterplot on 2-variable combinations

If two variables are correlated we should see a relateness between the points.

In [111]:
fig = go.Figure()

axis_labels = []

count = 0
for i in range(len(data.columns)):
    for j in range(i + 1, len(data.columns)):
      count += 1

      fig2 = px.scatter(data, x=data.iloc[:, i],
                              y=data.iloc[:, j],
                              color="target")

      fig.add_trace(fig2.data[0])

      axis_labels.append([data.columns[i], data.columns[j]])

fig.data[count - 1].visible = True

steps = []
for i in range(len(fig.data)):
    step = dict(
        method="update",
        args=[{"visible": [False] * len(fig.data)},
              {"title": "Slider switched to step: " + str(i),
               "xaxis.title": axis_labels[i][0],
               "yaxis.title": axis_labels[i][1]
               }],
    )
    step["args"][0]["visible"][i] = True
    steps.append(step)

sliders = [dict(
    active=count - 1,
    currentvalue={"prefix": "Frequency: "},
    pad={"t": 50},
    steps=steps
)]

fig.update_layout(
    sliders=sliders,
    plot_bgcolor='rgba(100,100,100,2)'
)

fig.show()

3D Graphs

In [112]:
variables = ["age", "trestbps", "chol", "thalach", "oldpeak"]

axis_labels = []

fig = go.Figure()

count = 0
for i in range(len(variables)):
    for j in range(i + 1, len(variables)):
        for k in range(j + 1, len(variables)):
          count += 1

          fig2 = px.scatter_3d(data, x=variables[i],
                              y=variables[j],
                              z=variables[k],
                              color="target")

          fig.add_trace(fig2.data[0])

          axis_labels.append([variables[i], variables[j], variables[k]])

fig.data[count - 1].visible = True

steps = []
for i in range(len(fig.data)):
    step = dict(
        method="update",
        args=[{"visible": [False] * len(fig.data)},
              {"title": "Slider switched to step: " + str(i),
               "scene.xaxis.title": axis_labels[i][0],
               "scene.yaxis.title": axis_labels[i][1],
               "scene.zaxis.title": axis_labels[i][2],
               }],
    )
    step["args"][0]["visible"][i] = True
    steps.append(step)

sliders = [dict(
    active=count - 1,
    currentvalue={"prefix": "Frequency: "},
    pad={"t": 50},
    steps=steps
)]

fig.update_layout(
    sliders=sliders
)

fig.show()

Principal Components Analysis

  • PCA is used in order to reduce dimensionality, increase interpretability while also minimizing information loss.
  • Using the first 3 principal components we would explain more than 40% of the variance.
In [113]:
normalized_data = data.copy()
normalized_data = normalized_data.drop(columns=["target"])
normalized_data = (normalized_data - normalized_data.mean()) / normalized_data.std()

pca = PCA(n_components=normalized_data.shape[1])
pca.fit(normalized_data)

%matplotlib inline
plt.plot(pca.explained_variance_ratio_)
plt.ylabel('Explained Variance')
plt.xlabel('Components')
plt.xticks([i for i in range(len(pca.explained_variance_ratio_))])
plt.show()

Using 3 principal components:

In [114]:
pca3 = PCA(n_components=3)
new_data_3 = pca3.fit_transform(normalized_data)

new_data_3 = pd.DataFrame(data = new_data_3, columns = ['PCA 1', 'PCA 2', 'PCA 3'])

print(f'Explained variation per principal component: {pca3.explained_variance_ratio_}. Total: {sum(pca3.explained_variance_ratio_)}')

pca3_by_variables = pd.DataFrame(pca3.components_, columns = data.drop(columns="target").columns)
Explained variation per principal component: [0.21254053 0.11820708 0.09406418]. Total: 0.4248117853438838

How much does each variable contribute to the principal components

In [115]:
pca3_by_variables.T.index
Out[115]:
Index(['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach',
       'exang', 'oldpeak', 'slope', 'ca', 'thal'],
      dtype='object')

Using 5 principal components

In [116]:
pca5 = PCA(n_components=5)
new_data_5 = pca5.fit_transform(normalized_data)

new_data_5 = pd.DataFrame(data = new_data_5, columns = ['PCA 1', 'PCA 2', 'PCA 3', 'PCA 4', 'PCA 5'])

print(f'Explained variation per principal component: {pca5.explained_variance_ratio_}. Total: {sum(pca5.explained_variance_ratio_)}')

pca_by_variables = pd.DataFrame(pca5.components_, columns = data.drop(columns="target").columns)
Explained variation per principal component: [0.21254053 0.11820708 0.09406418 0.09085735 0.07861281]. Total: 0.59428194322653

How much does each variable contribute to the principal components

In [117]:
pca_by_variables
Out[117]:
age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal
0 0.314203 0.090838 -0.274607 0.183920 0.117375 0.073640 -0.127728 -0.416498 0.361267 0.419639 -0.379772 0.273262 0.222024
1 0.406149 -0.377792 0.297266 0.438187 0.364514 0.317433 -0.220882 0.077876 -0.263118 -0.052255 0.048374 0.094147 -0.200720
2 -0.094077 0.554849 0.356974 0.203849 -0.407825 0.481736 -0.089191 0.158255 -0.126356 0.110343 -0.073818 0.183569 0.125011
3 -0.020662 -0.255309 0.287900 0.022601 -0.343410 -0.068605 0.266096 -0.184125 -0.115056 0.326296 -0.494849 -0.328016 -0.389191
4 -0.307153 0.050704 0.163179 0.188138 0.320067 -0.233442 -0.393667 0.323284 0.034536 0.250579 -0.246823 -0.435365 0.331950
In [118]:
new_data_3["target"] = data["target"]

fig = px.scatter_3d(new_data_3, x=new_data_3.columns[0], y=new_data_3.columns[1], z=new_data_3.columns[2], color='target')
for attribute in pca3_by_variables.T.index:
    x = np.array([0, pca3_by_variables[attribute][0]]) * 10
    y = np.array([0, pca3_by_variables[attribute][1]]) * 10
    z = np.array([0, pca3_by_variables[attribute][2]]) * 10
    fig.add_trace(go.Scatter3d(x=x, y=y,z=z, mode='lines', name=attribute)) 

fig.update_layout(legend=dict(
    orientation="h",
    yanchor="bottom",
    y=1.02,
    xanchor="right",
    x=1
))
fig.show()

Non-linear bi-dimensional mappings

Non-linear mappings transforms the data from a multi-dimensional space to a bi-dimensional one.

Sammon

Python doesn't have a library that includes Sammon mapping. That being said, we used the function from this repo.

In [119]:
def sammon(x, n, display = 2, inputdist = 'raw', maxhalves = 20, maxiter = 500, tolfun = 1e-9, init = 'default'):
    import numpy as np 
    from scipy.spatial.distance import cdist

    if inputdist == 'distance':
        D = x
        if init == 'default':
            init = 'cmdscale'
    else:
        D = cdist(x, x)
        if init == 'default':
            init = 'pca'

    if inputdist == 'distance' and init == 'pca':
        raise ValueError("Cannot use init == 'pca' when inputdist == 'distance'")

    if np.count_nonzero(np.diagonal(D)) > 0:
        raise ValueError("The diagonal of the dissimilarity matrix must be zero")

    N = x.shape[0]
    scale = 0.5 / D.sum()
    D = D + np.eye(N)     

    print(np.count_nonzero(D<=0))

    if np.count_nonzero(D<=0) > 0:
        raise ValueError("Off-diagonal dissimilarities must be strictly positive")   

    Dinv = 1 / D
    if init == 'pca':
        [UU,DD,_] = np.linalg.svd(x)
        y = UU[:,:n]*DD[:n]
    else:
        y = np.random.normal(0.0,1.0,[N,n])
    one = np.ones([N,n])
    d = cdist(y,y) + np.eye(N)
    dinv = 1. / d
    delta = D-d 
    E = ((delta**2)*Dinv).sum() 

    for i in range(maxiter):
        delta = dinv - Dinv
        deltaone = np.dot(delta,one)
        g = np.dot(delta,y) - (y * deltaone)
        dinv3 = dinv ** 3
        y2 = y ** 2
        H = np.dot(dinv3,y2) - deltaone - np.dot(2,y) * np.dot(dinv3,y) + y2 * np.dot(dinv3,one)
        s = -g.flatten(order='F') / np.abs(H.flatten(order='F'))
        y_old    = y

        for j in range(maxhalves):
            s_reshape = np.reshape(s, (-1,n),order='F')
            y = y_old + s_reshape
            d = cdist(y, y) + np.eye(N)
            dinv = 1 / d
            delta = D - d
            E_new = ((delta**2)*Dinv).sum()
            if E_new < E:
                break
            else:
                s = 0.5*s

        if j == maxhalves-1:
            print('Warning: maxhalves exceeded. Sammon mapping may not converge...')

        if abs((E - E_new) / E) < tolfun:
            if display:
                print('TolFun exceeded: Optimisation terminated')
            break

        E = E_new
        if display > 1:
            print('epoch = %d : E = %12.10f'% (i+1, E * scale))

    if i == maxiter-1:
        print('Warning: maxiter exceeded. Sammon mapping may not have converged...')

    E = E * scale
    
    return [y,E]
In [120]:
data_matrix = data.to_numpy()[:, :-1]
x, index = np.unique(data_matrix, axis=0, return_index=True)
target = data.to_numpy()[:, -1]
target = target[index]

y, E = sammon(x, 2)

fig = go.Figure()
fig.add_trace(go.Scatter(x=y[target==0, 0], y=y[target==0, 1],
                    mode='markers',
                    name='Negative Heart Disease'))
fig.add_trace(go.Scatter(x=y[target==1, 0], y=y[target==1, 1],
                    mode='markers',
                    name='Positive Heart Disease'))

fig.update_layout(title='Sammon Mapping')

fig.show()
0
epoch = 1 : E = 0.0181604779
epoch = 2 : E = 0.0176883499
epoch = 3 : E = 0.0142568584
epoch = 4 : E = 0.0110573232
epoch = 5 : E = 0.0101883796
epoch = 6 : E = 0.0098278906
epoch = 7 : E = 0.0096836608
epoch = 8 : E = 0.0095361633
epoch = 9 : E = 0.0094897114
epoch = 10 : E = 0.0094344619
epoch = 11 : E = 0.0094111562
epoch = 12 : E = 0.0093880317
epoch = 13 : E = 0.0093754583
epoch = 14 : E = 0.0093664388
epoch = 15 : E = 0.0093606849
epoch = 16 : E = 0.0093569435
epoch = 17 : E = 0.0093544723
epoch = 18 : E = 0.0093527011
epoch = 19 : E = 0.0093515285
epoch = 20 : E = 0.0093507524
epoch = 21 : E = 0.0093501950
epoch = 22 : E = 0.0093497684
epoch = 23 : E = 0.0093493848
epoch = 24 : E = 0.0093493031
epoch = 25 : E = 0.0093484702
epoch = 26 : E = 0.0093481235
epoch = 27 : E = 0.0093479421
epoch = 28 : E = 0.0093478068
epoch = 29 : E = 0.0093477537
epoch = 30 : E = 0.0093476930
epoch = 31 : E = 0.0093472257
epoch = 32 : E = 0.0093471423
epoch = 33 : E = 0.0093470615
epoch = 34 : E = 0.0093469821
epoch = 35 : E = 0.0093469142
epoch = 36 : E = 0.0093468624
epoch = 37 : E = 0.0093468236
epoch = 38 : E = 0.0093467932
epoch = 39 : E = 0.0093467693
epoch = 40 : E = 0.0093467494
epoch = 41 : E = 0.0093467349
epoch = 42 : E = 0.0093467224
epoch = 43 : E = 0.0093467217
epoch = 44 : E = 0.0093467177
epoch = 45 : E = 0.0093466311
epoch = 46 : E = 0.0093465923
epoch = 47 : E = 0.0093465670
epoch = 48 : E = 0.0093465534
epoch = 49 : E = 0.0093465182
epoch = 50 : E = 0.0093465123
epoch = 51 : E = 0.0093464383
epoch = 52 : E = 0.0093463895
epoch = 53 : E = 0.0093463106
epoch = 54 : E = 0.0093462507
epoch = 55 : E = 0.0093458947
epoch = 56 : E = 0.0093457637
epoch = 57 : E = 0.0093456971
epoch = 58 : E = 0.0093456546
epoch = 59 : E = 0.0093456262
epoch = 60 : E = 0.0093456060
epoch = 61 : E = 0.0093455912
epoch = 62 : E = 0.0093455795
epoch = 63 : E = 0.0093455704
epoch = 64 : E = 0.0093455628
epoch = 65 : E = 0.0093455566
epoch = 66 : E = 0.0093455514
epoch = 67 : E = 0.0093455470
epoch = 68 : E = 0.0093455432
epoch = 69 : E = 0.0093455399
epoch = 70 : E = 0.0093455370
epoch = 71 : E = 0.0093455344
epoch = 72 : E = 0.0093455321
epoch = 73 : E = 0.0093455300
epoch = 74 : E = 0.0093455279
epoch = 75 : E = 0.0093455261
epoch = 76 : E = 0.0093455243
epoch = 77 : E = 0.0093455227
epoch = 78 : E = 0.0093455210
epoch = 79 : E = 0.0093455196
epoch = 80 : E = 0.0093455181
epoch = 81 : E = 0.0093455167
epoch = 82 : E = 0.0093455153
epoch = 83 : E = 0.0093455140
epoch = 84 : E = 0.0093455127
epoch = 85 : E = 0.0093455115
epoch = 86 : E = 0.0093455102
epoch = 87 : E = 0.0093455091
epoch = 88 : E = 0.0093455079
epoch = 89 : E = 0.0093455069
epoch = 90 : E = 0.0093455058
epoch = 91 : E = 0.0093455048
epoch = 92 : E = 0.0093455037
epoch = 93 : E = 0.0093455027
epoch = 94 : E = 0.0093455017
epoch = 95 : E = 0.0093455009
epoch = 96 : E = 0.0093454999
epoch = 97 : E = 0.0093454991
epoch = 98 : E = 0.0093454981
epoch = 99 : E = 0.0093454973
epoch = 100 : E = 0.0093454965
epoch = 101 : E = 0.0093454957
epoch = 102 : E = 0.0093454949
epoch = 103 : E = 0.0093454942
epoch = 104 : E = 0.0093454934
epoch = 105 : E = 0.0093454927
epoch = 106 : E = 0.0093454920
epoch = 107 : E = 0.0093454913
epoch = 108 : E = 0.0093454906
epoch = 109 : E = 0.0093454900
epoch = 110 : E = 0.0093454893
epoch = 111 : E = 0.0093454888
epoch = 112 : E = 0.0093454881
epoch = 113 : E = 0.0093454876
epoch = 114 : E = 0.0093454870
epoch = 115 : E = 0.0093454864
epoch = 116 : E = 0.0093454859
epoch = 117 : E = 0.0093454854
epoch = 118 : E = 0.0093454848
epoch = 119 : E = 0.0093454843
epoch = 120 : E = 0.0093454838
epoch = 121 : E = 0.0093454834
epoch = 122 : E = 0.0093454829
epoch = 123 : E = 0.0093454824
epoch = 124 : E = 0.0093454820
epoch = 125 : E = 0.0093454816
epoch = 126 : E = 0.0093454811
epoch = 127 : E = 0.0093454807
epoch = 128 : E = 0.0093454803
epoch = 129 : E = 0.0093454799
epoch = 130 : E = 0.0093454795
epoch = 131 : E = 0.0093454792
epoch = 132 : E = 0.0093454788
epoch = 133 : E = 0.0093454785
epoch = 134 : E = 0.0093454781
epoch = 135 : E = 0.0093454778
epoch = 136 : E = 0.0093454774
epoch = 137 : E = 0.0093454771
epoch = 138 : E = 0.0093454768
epoch = 139 : E = 0.0093454765
epoch = 140 : E = 0.0093454762
epoch = 141 : E = 0.0093454759
epoch = 142 : E = 0.0093454756
epoch = 143 : E = 0.0093454754
epoch = 144 : E = 0.0093454750
epoch = 145 : E = 0.0093454748
epoch = 146 : E = 0.0093454745
epoch = 147 : E = 0.0093454743
epoch = 148 : E = 0.0093454740
epoch = 149 : E = 0.0093454738
epoch = 150 : E = 0.0093454736
epoch = 151 : E = 0.0093454734
epoch = 152 : E = 0.0093454731
epoch = 153 : E = 0.0093454729
epoch = 154 : E = 0.0093454727
epoch = 155 : E = 0.0093454725
epoch = 156 : E = 0.0093454723
epoch = 157 : E = 0.0093454721
epoch = 158 : E = 0.0093454719
epoch = 159 : E = 0.0093454718
epoch = 160 : E = 0.0093454715
epoch = 161 : E = 0.0093454714
epoch = 162 : E = 0.0093454712
epoch = 163 : E = 0.0093454711
epoch = 164 : E = 0.0093454708
epoch = 165 : E = 0.0093454707
epoch = 166 : E = 0.0093454705
epoch = 167 : E = 0.0093454704
epoch = 168 : E = 0.0093454702
epoch = 169 : E = 0.0093454701
epoch = 170 : E = 0.0093454699
epoch = 171 : E = 0.0093454699
epoch = 172 : E = 0.0093454697
epoch = 173 : E = 0.0093454696
epoch = 174 : E = 0.0093454694
epoch = 175 : E = 0.0093454693
epoch = 176 : E = 0.0093454692
epoch = 177 : E = 0.0093454691
epoch = 178 : E = 0.0093454689
epoch = 179 : E = 0.0093454689
epoch = 180 : E = 0.0093454687
epoch = 181 : E = 0.0093454686
epoch = 182 : E = 0.0093454685
epoch = 183 : E = 0.0093454684
epoch = 184 : E = 0.0093454683
epoch = 185 : E = 0.0093454682
epoch = 186 : E = 0.0093454681
epoch = 187 : E = 0.0093454680
epoch = 188 : E = 0.0093454679
epoch = 189 : E = 0.0093454679
epoch = 190 : E = 0.0093454677
epoch = 191 : E = 0.0093454677
epoch = 192 : E = 0.0093454676
epoch = 193 : E = 0.0093454675
epoch = 194 : E = 0.0093454674
epoch = 195 : E = 0.0093454674
epoch = 196 : E = 0.0093454673
epoch = 197 : E = 0.0093454672
epoch = 198 : E = 0.0093454671
epoch = 199 : E = 0.0093454671
epoch = 200 : E = 0.0093454670
epoch = 201 : E = 0.0093454669
epoch = 202 : E = 0.0093454668
epoch = 203 : E = 0.0093454668
epoch = 204 : E = 0.0093454667
epoch = 205 : E = 0.0093454667
epoch = 206 : E = 0.0093454666
epoch = 207 : E = 0.0093454666
epoch = 208 : E = 0.0093454665
epoch = 209 : E = 0.0093454665
epoch = 210 : E = 0.0093454664
TolFun exceeded: Optimisation terminated

t-SNE

In [121]:
y = TSNE(n_components=2, learning_rate='auto', init="pca").fit_transform(x)

fig = go.Figure()
fig.add_trace(go.Scatter(x=y[target==0, 0], y=y[target==0, 1],
                    mode='markers',
                    name='Negative Heart Disease'))
fig.add_trace(go.Scatter(x=y[target==1, 0], y=y[target==1, 1],
                    mode='markers',
                    name='Positive Heart Disease'))

fig.update_layout(title='t-SNE Mapping')

fig.show()
/usr/local/lib/python3.7/dist-packages/sklearn/manifold/_t_sne.py:986: FutureWarning:

The PCA initialization in TSNE will change to have the standard deviation of PC1 equal to 1e-4 in 1.2. This will ensure better convergence.

uMap

In [122]:
# Has learning_rate, different distance metrics, spread, min_dist
y = umap.UMAP().fit_transform(x)

fig = go.Figure()
fig.add_trace(go.Scatter(x=y[target==0, 0], y=y[target==0, 1],
                    mode='markers',
                    name='Negative Heart Disease'))
fig.add_trace(go.Scatter(x=y[target==1, 0], y=y[target==1, 1],
                    mode='markers',
                    name='Positive Heart Disease'))

fig.update_layout(title='uMap Mapping')

fig.show()

Projection Pursuit

Available in the other document

Conditional boxplots

Plotting every numerical variable according to a categorical one as a boxplot. If the boxplots notches are not overlapping that means that there is a statistically significant difference between the medians.

In [123]:
numerical_variables = ["age", "trestbps", "chol", "thalach", "oldpeak"]
categorical_variables = ["sex", "fbs", "restecg", "exang", "cp", "slope", "ca", "thal", "target"]
axis_labels = []

fig = go.Figure()

count = 0

for nv in numerical_variables:
  for cv in categorical_variables:
    count += 1

    axis_labels.append([cv, nv])
    fig.add_trace(go.Box(x=data[cv], y=data[nv]))

fig.data[count - 1].visible = True

steps = []
for i in range(len(fig.data)):
    step = dict(
        method="update",
        args=[{"visible": [False] * len(fig.data)},
              {"title": "Slider switched to step: " + str(i),
               "xaxis.title": axis_labels[i][0],
               "yaxis.title": axis_labels[i][1]
               }],
    )
    step["args"][0]["visible"][i] = True
    steps.append(step)

sliders = [dict(
    active=count - 1,
    currentvalue={"prefix": "Frequency: "},
    pad={"t": 50},
    steps=steps
)]

fig.update_layout(
    sliders=sliders
)

fig.show()

We can see that:

  • age medians according to sex, fb, exang are (almost) overlapping.
  • trestbps -//- to sex, exang, slope, target
  • chol -//- to sex, fbs, exang, cp, slope
  • thalach -//- to fbs
  • oldpeak -//- to sex, fbs

Overlay Histogram

Allows us to visualize and compare the decompositions of the numeric value superimposed on each other.

In [124]:
numerical_variables = ["age", "trestbps", "chol", "thalach", "oldpeak"]
categorical_variables = ["sex", "fbs", "restecg", "exang", "cp", "slope", "ca", "thal", "target"]
axis_labels = []
show_traces = []

fig = go.Figure()

count = 0

for nv in numerical_variables:
  for cv in categorical_variables:
    fig2 = px.histogram(data, x=nv, color=cv)
    traces_to_show = []

    for i in range(len(fig2.data)):
      traces_to_show.append(count)
      count += 1

      histogram = fig2.data[i]
      histogram.name = cv + " " + str(histogram.name)
      histogram.opacity = 0.8
      fig.add_trace(histogram)

    show_traces.append(traces_to_show)
    axis_labels.append([nv, cv])

steps = []
for i in range(len(show_traces)):
    step = dict(
        method="update",
        args=[{"visible": [False] * len(fig.data)},
              {"title": "Showing in relation to " + axis_labels[i][1],
               "xaxis.title": axis_labels[i][0],
               "yaxis.title": "count"
               }],
    )

    for j in show_traces[i]:
      step["args"][0]["visible"][j] = True

    steps.append(step)

sliders = [dict(
    active=count - 1,
    currentvalue={"prefix": "Frequency: "},
    pad={"t": 50},
    steps=steps
)]

fig.update_layout(
    sliders=sliders,
    barmode="overlay"
)

fig.show()

Stacked Histogram

Used to determine the relative decomposition of an attribute according to a categorical variable. Thus, we can see the contribution of each secondary group in overall value.

In [125]:
numerical_variables = ["age", "trestbps", "chol", "thalach", "oldpeak"]
categorical_variables = ["sex", "fbs", "restecg", "exang", "cp", "slope", "ca", "thal", "target"]
axis_labels = []
show_traces = []

fig = go.Figure()

count = 0

for nv in numerical_variables:
  for cv in categorical_variables:
    fig2 = px.histogram(data, x=nv, color=cv)
    traces_to_show = []

    for i in range(len(fig2.data)):
      traces_to_show.append(count)
      count += 1

      histogram = fig2.data[i]
      histogram.name = cv + " " + str(histogram.name)
      histogram.opacity = 0.8
      fig.add_trace(histogram)

    show_traces.append(traces_to_show)
    axis_labels.append([nv, cv])

steps = []
for i in range(len(show_traces)):
    step = dict(
        method="update",
        args=[{"visible": [False] * len(fig.data)},
              {"title": "Showing in relation to " + axis_labels[i][1],
               "xaxis.title": axis_labels[i][0],
               "yaxis.title": "count"
               }],
    )

    for j in show_traces[i]:
      step["args"][0]["visible"][j] = True

    steps.append(step)

sliders = [dict(
    active=count - 1,
    currentvalue={"prefix": "Frequency: "},
    pad={"t": 50},
    steps=steps
)]

fig.update_layout(
    sliders=sliders,
    barmode="stack"
)

fig.show()

Dependency analysis

Conditioned Boxplots

Helps us visualize how dependent an attribute is to the target.

In [126]:
axis_labels = []

fig = go.Figure()

count = 0

for pred in data.columns[:-1]:
  count += 1

  axis_labels.append(["target", pred])
  fig.add_trace(go.Box(x=data["target"], y=data[pred]))

fig.data[count - 1].visible = True

steps = []
for i in range(len(fig.data)):
    step = dict(
        method="update",
        args=[{"visible": [False] * len(fig.data)},
              {"title": "Slider switched to step: " + str(i),
               "xaxis.title": axis_labels[i][0],
               "yaxis.title": axis_labels[i][1]
               }],
    )
    step["args"][0]["visible"][i] = True
    steps.append(step)

sliders = [dict(
    active=count - 1,
    currentvalue={"prefix": "Frequency: "},
    pad={"t": 50},
    steps=steps
)]

fig.update_layout(
    sliders=sliders
)

fig.show()

Correlation between attributes and target

In [127]:
data.corr(method="kendall")[-1:].style.background_gradient(cmap='Blues')
Out[127]:
  age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal target
target -0.197857 -0.280937 0.430506 -0.102064 -0.099131 -0.028046 0.147678 0.352609 -0.436757 -0.361731 0.361406 -0.430124 -0.392595 1.000000
In [128]:
corr_data = data.corr(method="kendall")[-1:].to_numpy()[0,:-1]

fig = go.Figure()
fig.add_trace(go.Bar(x=data.columns[:-1], y=corr_data))
fig.update_layout(title="Correlation barplot")
fig.show()

Independence tests for target

Chi-squared Independence test

  • We can see that sex, restecg, exang, cp, slope, ca and thal test results are relevant, while fbs test results are not relevant.
  • The ${chi}^2$ represents the expected frequencies.
In [129]:
dof = 0.05

categorical_variables = ["sex", "fbs", "restecg", "exang", "cp", "slope", "ca", "thal"]

for i in range(len(categorical_variables)):
    ct = pd.crosstab(data[categorical_variables[i]], data["target"], margins=True)
    values = [ct[i][0:len(ct.index) - 1].values for i in range(len(ct.columns) - 1)]
    chi2, p, _, _ = stats.chi2_contingency(values)
    print(f"Independence between {categorical_variables[i]:<7} and target. Chi-squared test: ")
    print(f"chi2: {chi2:.3f}, p-value: {p:.3f}")
    print(f"{'Failed, p-value bigger than ' + str(dof) if p > dof else 'Relevant'}\n")
Independence between sex     and target. Chi-squared test: 
chi2: 22.717, p-value: 0.000
Relevant

Independence between fbs     and target. Chi-squared test: 
chi2: 0.106, p-value: 0.744
Failed, p-value bigger than 0.05

Independence between restecg and target. Chi-squared test: 
chi2: 10.023, p-value: 0.007
Relevant

Independence between exang   and target. Chi-squared test: 
chi2: 55.945, p-value: 0.000
Relevant

Independence between cp      and target. Chi-squared test: 
chi2: 81.686, p-value: 0.000
Relevant

Independence between slope   and target. Chi-squared test: 
chi2: 47.507, p-value: 0.000
Relevant

Independence between ca      and target. Chi-squared test: 
chi2: 74.367, p-value: 0.000
Relevant

Independence between thal    and target. Chi-squared test: 
chi2: 85.304, p-value: 0.000
Relevant

Fisher's test

  • We can see that sex and exang test results are relevant while fbs test results are not.
  • The ratio is the prior odds ratio.
In [130]:
dof = 0.05

binary_variables = ["sex", "fbs", "exang"]

for i in range(len(binary_variables)):
    ct = pd.crosstab(data[binary_variables[i]], data["target"], margins=True)
    values = [ct[i][0:len(ct.index) - 1].values for i in range(len(ct.columns) - 1)]
    ratio, p2 = stats.fisher_exact(values)
    print(f"Independence between {binary_variables[i]:<7} and target. Fisher's exact test:")
    print(f"ratio: {ratio:.3f}, p-value: {p2:.3f}")
    print(f"{'Failed, p-value bigger than ' + str(dof) if p2 > dof else 'Relevant'}\n")
Independence between sex     and target. Fisher's exact test:
ratio: 0.272, p-value: 0.000
Relevant

Independence between fbs     and target. Fisher's exact test:
ratio: 0.854, p-value: 0.631
Failed, p-value bigger than 0.05

Independence between exang   and target. Fisher's exact test:
ratio: 0.132, p-value: 0.000
Relevant

T-test

  • We can see that t-values obtained by using sex and exang are relevant, while thos obtained by using fbs are not.
  • The t-values quantify the difference between the arithmetic means of the two samples
In [131]:
binary_variables = ["sex", "fbs", "exang"]

for column in binary_variables:
    group_1 = data["target"][data[column] == 1]
    group_2 = data["target"][data[column] == 0]
    t, p = stats.ttest_ind(group_1, group_2)
    print(f"Variable target separated by {column:<5} has t-value: {t:.3f} and p-value: {p:.3f}.")
    print(f"{'Failed, p-value bigger than 0.05' if p > 0.05 else 'Relevant'}\n")
Variable target separated by sex   has t-value: -5.079 and p-value: 0.000.
Relevant

Variable target separated by fbs   has t-value: -0.487 and p-value: 0.627.
Failed, p-value bigger than 0.05

Variable target separated by exang has t-value: -8.423 and p-value: 0.000.
Relevant

Conclusion

  • From binary variables, the variables more likely to influence the target are:
    • exang, according to all three tests
    • sex, according to all three tests
  • From the other categorial variables, the variables more likely to influence the target are:
    • cp, according to Chi-squared Test
    • ca, according to Chi-squared Test
    • thal, according to Chi-squared Test
    • slope, according to Chi-squared Test
  • From the other variables, the ones more likely to influence the target are:
    • oldpeak, according to the correlation matrix and barplots
    • thalach, according to the correlation matrix and barplots